All the results were filtered by the adj.p-value < 0.05 and logFC >= 0.3
## An object of class Seurat
## 48361 features across 12617 samples within 1 assay
## Active assay: RNA (48361 features, 0 variable features)
For single cell methods I am going to consider both Seurat and Scanpy
For Scanpy I take in account the methods: - t-test_overestim_var - wilcoxon
Unfortunately the logistic regression provided by scanpy does not give LogFoldChanges and will not be taken in account.
## INFO [2023-01-13 11:45:27] $x
## INFO [2023-01-13 11:45:27] list(scanpy_t$Symbol, scanpy_wx$Symbol)
## INFO [2023-01-13 11:45:27]
## INFO [2023-01-13 11:45:27] $category.names
## INFO [2023-01-13 11:45:27] c("scanpy_t", "scanpy_wx")
## INFO [2023-01-13 11:45:27]
## INFO [2023-01-13 11:45:27] $filename
## INFO [2023-01-13 11:45:27] NULL
## INFO [2023-01-13 11:45:27]
## INFO [2023-01-13 11:45:27] $disable.logging
## INFO [2023-01-13 11:45:27] T
## INFO [2023-01-13 11:45:27]
The t-test_overestim_var recognized as significative almost all the genes and include almost all the genes from the wilcoxon test. Nolw let’s evaluate the expression of the genes.
Top upregulated genes in Exhausted NK cells:
Top downregulated genes in Exhausted NK cells:
Genes with lower abs(logFC)
Apparently Scanpy has a bias evaluating the logFC of
genes that are completely missing in one of the conditions.
But The genes are correctly plotted in the report. So I will check the
expression by the z-score instead of the LogFC:
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
Better result for the extremes logFC (scores), but still not clear results for the low scored genes. Indeed those genes are expressed by a small proportion of cells:
## Symbol pvals_adj logfoldchanges abs_lfc scores NK.exhausted
## 43 Z83847.1 0.04939069 -21.810220 21.810220 -3.404695 0.000000000
## 44 AL603910.1 0.04926814 -21.742660 21.742660 -3.406065 0.000000000
## 45 RALGPS1 0.04926814 -2.078307 2.078307 -3.407628 0.001892148
## NK.resident
## 43 0.002083333
## 44 0.002083333
## 45 0.008333333
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
GO terms can be linked to immune cell activity, down terms (up for resident NK cells) are correlated with translation regulation.
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Better differentially expressed genes. GO looks quite similar.
For Seurat I take in account the methods: - t-test - wilcoxon - logistic regression
Similar numbers of DE genes were found by Seurat’s different flavours.
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Better result with seurat, the differences of expression are small (same cell types, just different state). In average the different gene expression is better appreciable
Enriched GO terms from upregulated genes are specific of immune activation. Downregulated GO terms are inherent to response to various factor, could it be a hint of the missing response to the cancer cell in exhausted NK cell?
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Similar to the previous
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
As above.
In the pseudobulks, counts were summed for each celltype and sample, in practice obtaining a column of counts for each celltype and sample. Then, genes that weren’t showing counts in more than 90% of the cells were removed. Also, genes showing expression below 10 counts in all the pseudobulks were removed (pseudobulks are the sum of cells counts. showing less than 10 counts indicate almost all the cells didn’t had count’s for that gene).
## [1] 8390 67
For DESeq2, firstly I run it with the “LRT” methods that showed better results here. The PCA plot and clustering clustering are mixed, probably because the division based on cell type here. I run DESeq2 specifing the model to take account of the batches.
Then, I tested combat_seq to correct the batch effect here. Also, I decided
to see if there was a batch effect afrom unknown sources using the
single variable analysis (sva) here.
## INFO [2023-01-13 11:45:31] $x
## INFO [2023-01-13 11:45:31] list(deseq2$Symbol, deseq2_combat$Symbol, deseq2_sva$Symbol)
## INFO [2023-01-13 11:45:31]
## INFO [2023-01-13 11:45:31] $category.names
## INFO [2023-01-13 11:45:31] c("deseq2", "deseq2_combat", "deseq2_sva")
## INFO [2023-01-13 11:45:31]
## INFO [2023-01-13 11:45:31] $filename
## INFO [2023-01-13 11:45:31] NULL
## INFO [2023-01-13 11:45:31]
## INFO [2023-01-13 11:45:31] $disable.logging
## INFO [2023-01-13 11:45:31] T
## INFO [2023-01-13 11:45:31]
DESeq2 without batch correction found 147 DE genes,
sva
that corrected by batch effect without prion knowledge found alomt 1000
genes Combat correction found only 49 genes.
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Downregulated GO terms"
Differencies in the expression aren’t clear. No up-regulated pathways where found. Downregulated are similar to Seurat
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Result from combat are more clear in DGE. The Go terms looks similar to previous analysis
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Similarly to DESeq2, the genes differencies in expression aren’t clear. GO terms similar to seurat/deseq2
Normal and with combat
## INFO [2023-01-13 11:45:31] $x
## INFO [2023-01-13 11:45:31] list(limma$Symbol, limma_combat$Symbol)
## INFO [2023-01-13 11:45:31]
## INFO [2023-01-13 11:45:31] $category.names
## INFO [2023-01-13 11:45:31] c("limma", "limma_combat")
## INFO [2023-01-13 11:45:31]
## INFO [2023-01-13 11:45:31] $filename
## INFO [2023-01-13 11:45:31] NULL
## INFO [2023-01-13 11:45:31]
## INFO [2023-01-13 11:45:31] $disable.logging
## INFO [2023-01-13 11:45:31] T
## INFO [2023-01-13 11:45:31]
Limma found more than 600 genes, no differences with the batch corrected data
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
high LogFC aren’t well apreciable
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
Down-regulated GO terms more correlated to translation regulation
I tested edgeR with LRT as best option from here, I tested combat too. edgeR edgeR_combat
## INFO [2023-01-13 11:43:34] $x
## INFO [2023-01-13 11:43:34] list(edger$Symbol, edger_combat$Symbol)
## INFO [2023-01-13 11:43:34]
## INFO [2023-01-13 11:43:34] $category.names
## INFO [2023-01-13 11:43:34] c("edgeR", "edgeR_combat")
## INFO [2023-01-13 11:43:34]
## INFO [2023-01-13 11:43:34] $filename
## INFO [2023-01-13 11:43:34] NULL
## INFO [2023-01-13 11:43:34]
## INFO [2023-01-13 11:43:34] $disable.logging
## INFO [2023-01-13 11:43:34] T
## INFO [2023-01-13 11:43:34]
aroung 500 genes found by edgeR
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
High and low LogFC aren’t clear. Low abs(logFC) are better appreciable. BP terms contains both response related patways and translation regulation pathways
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
## [1] "Downregulated GO terms"
DE genes similar to EdgeR. BP translation-related
Tested a non-parametric test with a handwritten wilcoxon test (inspired from this)
## [1] "Top upregulated genes"
## [1] "Top downregulated genes"
## [1] "lowest abs(LFC) genes"
## [1] "Upregulated GO terms"
Differentially expressed genes not clear, BP (only up-regulated) similar to above.